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Abstract 

How many labeled examples are needed to estimate a classifier's per- 
formance on a new dataset? We study the case where data is plentiful, but 
labels are expensive. We show that by making a few reasonable assump- 
tions on the structure of the data, it is possible to estimate performance 
H- j curves, with confidence bounds, using a small number of ground truth 

r/3 labels. Our approach, which we call Semisupervised Performance Evalu- 

ation (SPE), is based on a generative model for the classifier's confidence 
scores. In addition to estimating the performance of classifiers on new 
datasets, SPE can be used to recalibrate a classifier by re-estimating the 
class-conditional confidence distributions. 

(N 

\o 

1 1 Introduction 

<N 

Consider a biologist who downloads software for classifying the behavior of fruit 
t— I flies. The classifier was laboriously trained by a different research group who 

labeled thousands of training examples to achieve satisfactory performance on 
I a validation set collected in some particular setting (see e.g. [5]). The biologist 

would be ill-advised if she trusted the published performance figures; maybe 
small lighting changes in her experimental setting have changed the statistics 
j^j of the data and rendered the classifier useless. However, if the biologist has to 

review all the labels assigned by the classifier to her dataset, just to be sure the 
classifier is performing up to expectation, then what is the point of obtaining 
a trained classifier in the first place? Is it possible at all to obtain a reliable 
evaluation of a classifier when unlabeled data is plentiful, but when the user is 
willing to provide only a small number of labeled examples? 

We propose a method for achieving minimally supervised evaluation of clas- 
sifiers, requiring as few as 10 labels to accurately estimate classifier performance. 
Our method is based on a generative Bayesian model for the confidence scores 
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Figure 1: Estimating detector performance with all but 10 labels unknown. A: 
Histogram of classifier scores Sj obtained by running the "ChnFtrs" detector [8 
on the INRIA dataset [5] . The red and green curves show the Gamma-Normal 
mixture model fitting the histogrammed scores with highest likelihood. The 
scores are all unlabeled, apart from 10, selected at random, which have labels. 
The shaded bands indicate the 90% probability bands around the model. The 
red and green bars show the labels of the 10 randomly sampled labels (by 
chance, the scores for some of the samples are close to each other, thus only 
6 bars are shown; the height of the bars has no meaning). B: Precision and 
recall curves computed from the mixture model in A. C: In black, precision- 
recall curve computed after all items have been labeled. In red, precision-recall 
curve estimated using SPE from only 10 labeled examples (with 90% confidence 
interval shown as the magenta band) . See Section [2] for a discussion. 



produced by the classifier, borrowing from the literature on semisupervised 
learning [TU |3U1 HI]- We show how to use the model to re-calibrate classi- 
fiers to new datasets by choosing thresholds to satisfy performance constraints 
with high likelihood. An additional contribution is a fast approximate inference 
method for doing inference in our model. 

2 Modeling the classifier score 

Let us start with a set of N data items, (xi,yi) £ 1Z D x {0, 1}, drawn from some 
unknown distribution p(x,y) and indexed by i £ {1, . . . , N}. Suppose that a 
classifier, h{xi\r) = [h(xi) > r], where r is some scalar threshold, has been 
used to classify all data items into two classes, iji £ {0, 1}. While the "ground 
truth" labels are assumed to be unknown, initially, we do have access to all 
the "scores," s, = h(xi), computed by the classifier. From this point onwards, 
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we forget about the data vectors X{ and concentrate solely on the scores and 
labels, (si,yi) € TZ x {0, 1}. 

The key assumption in this paper is that the list of scores S — (sj., . . . , sn) 
and the unknown labels Y = (y 1; . . . , yjy) can be modeled by a two-component 
mixture model p(S, Y \ 9), parameterized by 9, where the class-conditionals 
are standard parametric distributions. We show in Section |4.2| that this is a 
reasonable assumption for many datasets. 

Suppose that we can ask an expert (the "oracle") to provide the true label 
Hi for any data item. This is an expensive operation and our goal is to ask the 
oracle for as few labels as possible. The set of items that have been labeled by 
the oracle at time t is denoted by Ct and its complement, the set of items for 
which the ground truth is unknown, is denoted Lit- This setting is similar to 
scmisupervised learning [201 121] . By estimating p(S, Y \ 9), we will improve our 
estimate of the performance of h when \£ t \ "C N. 

Consider first the fully supervised case, i.e. where all labels yi are known. 
Let the scores Si be i.i.d. according to the two mixture model. If the all labels 
are known, and we assume independent observations, the likelihood of the data 
is given by, 

p(S,Y\ 9) = 11 (l-7r)po(s 4 | 0„ ) II I 9 0, W 

i-Vi=0 i:j/i=l 

where 9 = {t:,9q,9i}, and n G [0, 1] is the mixture weight, i.e. = 1) = tt. 
The component densities pq and p\ could be modeled parametrically by Normal 
distributions, Gamma distributions, or some other probability distributions ap- 



propriate for the given classifier (see Section 4.2 for a discussion about which 
class conditional distributions to choose). This approach of applying a gener- 
ative model to score distributions, when all labels are known, has been used 
in the past to obtain error estimates on classifier performance [T31 EH H] j and 
for classifier calibration [1 1. However, previous approaches require that the all 
items used to estimate the performance have been labeled. 

We suggest that it may be possible to estimate classifier performance even 
when only a fraction of the ground truth labels are known. In this case, the 
labels for the unlabeled items i G Lit can be marginalized out, 



p(S, Y t \9)=H ((1 - n)p ( Si | Bo) + 7rpi(* | 0i)) (2) 

\{^{i-'K) l -y> Pvi {s i \e m ), (3) 
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where Y t = {yt \ i G £*}. This allows the model to make use of the scores of 
unlabeled items in addition to the labeled items, which enables accurate per- 
formance estimates with only a handful of labels. Once we have the likelihood, 
we can take a Bayesian approach to estimate the parameters 9. Starting from 
a prior on the parameters, p{9), we can obtain a posterior p{9 \ S,Y t ) by using 
B ayes' rule, 

p(9\S,Y t )K P (S 1 Y t \9)p(9). (4) 
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Let us look at a real example. Figure [T^, shows a histogram of the scores 
obtained from classifier on a public dataset (see Section [1] for more information 
about the datasets we use). At first glance, it is difficult to guess the performance 
of the classifier unless the oracle provides a lot of labels. However, if we assume 
that the scores follow a two-component mixture model as in ([3]), with a Gamma 
distribution for the yi = and a Normal distribution for the j/j = 1 component, 
then there is a only a narrow choice of 9 that can explain the scores with high 
likelihood; the red and green curves in Figure [T^, show such a high probability 
hypothesis. As we will see in the next section, the posterior on 9 can be used 
to estimate the performance of the classifier. 



3 Estimating performance 

Most performance measures can be computed directly from the model param- 
eters 9. For example, two often used performance measures are the precision 
P(t; 9) and recall R(t; 9) at a particular score threshold t. We can define these 
quantities in terms of the conditional distributions p v {si \ 9 y ). Recall is defined 
as the fraction of the positive, i.e. yi — 1, examples that have scores above a 
given threshold, 

/•OO 

R(t-8) = \ Pl (s | d x )ds. (5) 



Precision is defined to be the fraction of all examples with scores above a given 
threshold that are positive, 

P t T . 0) = ( 6 ) 

1 ' ' 7TR(r;9) + (l-7r)J^p (s\9 )d S - 1 ' 

We can also compute the precision at a given level of recall by inverting R(t; 9), 
i.e. P r (r;9) = P(R^ 1 (r;9);9) for some recall r. Other performance measures, 
such as the equal error rate, true positive rate, true negative rate, sensitivity, 
specificity, and the ROC can be computed from 9 in a similar manner. 

The posterior on 9 can also be used to obtain confidence bounds on the 
performance of the classifier. For example, for some choice of parameters 9, 
the precision and recall can be computed for a range of score thresholds r to 
obtain a curve (see solid curves in Figure [Tja). Similarly, given the posterior on 
9, the distribution of P(t; 9) and R(t; 9) can be computed for a fixed r to obtain 
confidence intervals (shown as colored bands in Figure [T]b) . The same reasoning 
can be applied to the precision-recall curve: for some recall r, the distribution 
of precisions, found using P r (r;9) can be used to compute confidence intervals 
on the curve (see Figure [lj). 

While the approach of estimating performance based purely on the estimate 
of works well in limit when the number of data items N — ¥ oo, it has some 
drawbacks when N is small (on the order of 10 3 — 10 4 ) and it is unbalanced, in 
which case finite-sample effects come into play. This is especially the case when 
the number of positive examples is very small, say 10-100, in which case the 
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performance curve will be very jagged. Since the previous approach views the 
scores (and the associated labels) as a finite sample from p(S,Y \ 9), there will 
always be uncertainty in the performance estimate. When all items have been 
labeled by the oracle, the remaining uncertainty in the performance represents 
the variability in sampling (S,Y) from p(S,Y \ 9). In practice, however, the 
question that is often asked is, "What is our best guess for the classifier per- 
formance on this particular test set?" In other words, we are interested in the 
sample performance rather than the population performance. Thus, when the 
oracle has labeled the whole test set, there should not be any uncertainty in the 
performance; it can simply be computed directly from (S, Y). 

To estimate the sample performance, we need to account for uncertainty in 
the unlabeled items, i € Ut- This uncertainty is captured by the distribution of 
the unobserved labels Y( — {?/,; | i € I4t}> found by marginalizing out the model 
parameters, 



Here G is the space of all possible parameters. On the second line we rely on 
the assumption of a mixture model to factor the joint probability distribution 
on 9 and Y( . 

One way to think of this approach is as follows: imagine that we sample Y[ 
from p(Xl | S,Y t ). We can then use all the labels Y — Y t U Y( and the scores 
S to trace out a performance curve (e.g., a precision-recall curve). Now, as we 
repeat the sampling, each performance curve will look slightly different. Thus, 
the posterior distribution on Y/ in effect gives us a distribution of performance 
curves. We can use this distribution to compute quantities such as the expected 
performance curve, the variance in the curves, and confidence intervals. The 
main difference between the sample and population performance estimates will 
be at the tails of the score distribution, p(S | 9), where individual item labels 
can have a large impact on the performance curve. 

3.1 Sampling from the posterior 

In practice, we cannot compute p(Y[ \ S,Y t ) in Q analytically, so we must 
resort to approximate methods. For some choices of class conditional densities, 
p y (s | ), such as when they are Normal distributions, it is possible to carry 
out the marginalization over 9 in Q analytically. In that case one could use 
collapsed Gibbs sampling to sample from the posterior on Y( : as is often done 
for models involving the Dirichlet process [14]. A more generally applicable 
method, which we will describe here, is to split the sampling into three steps: 
(a) sample 9 from p(9 | S,Y t ), (b) fix the mixture parameters to 6 and sample 
the labels Y( given their associated scores, and (c) compute the performance, 
such as precision and recall, for all score thresholds t € S. By repeating these 
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three steps, we can obtain a sample from the distribution over the performance 
curves. 

The first step, sampling from the posterior p{9 \ S,Y t ), can be carried out 
using importance sampling (IS). We experimented with Metropolis-Hastings 
and Hamiltonian Monte Carlo [TS], but we found that IS worked well for this 
problem, required less parameter tuning, and was much faster. In IS, we sample 
from a proposal distribution q(9) in order to estimate properties of the desired 
distribution p(9 \ S,Y t ). Suppose we draw M samples of 9 from q(8) to get 
6 = {9 1 , . . . ,9 M }. Then, we can approximate expectations of some function 
g(-) using the weighted function evaluations, i.e. E\g] ~ ^ m= iW m g(9 m ). The 
weights w m G W correct for the bias introduced by sampling from q(9) and are 
defined as, 

_ p(9 m \S,Y t )/g(9 m ) 

For the datasets in this paper, we found that the state-space around the 
MAP estimate^ of 9, 

0* = argmaxp(0|S,y,), (9) 

a 

was well approximated by a multivariate Normal distribution. Hence, for the 
proposal distribution we used, 

q{9)=M(9\fi q ^ q ). (10) 

To simplify things further, we used a diagonal covariance matrix, S 9 . The 
elements along the diagonal of E g were found by fitting a univariate Normal 
locally to p(9 | S,Y t ) along each dimension of 9 while the other elements were 
fixed at their MAP-estimates. The mean of the proposal distribution, \i q , was 
set to the MAP estimate of 9. 

We now have all steps needed to estimate the performance of the classifier, 
given the scores S and some labels Y t obtained from the oracle: 

1. Find the MAP estimate fi q of 9 using 

2. Fit a proposal distribution q(9) to p(9 | S,Y t ) locally around \x q . 

3. Sample M instances of 9, 8 = {9 1 , . . . , 8 M }, from q(9) and calculate the 
weights w m £ W. 

4. For each 9 m G 6, sample the labels for i G U t to get Y[ = {Y^ 1: . . . , Y{ M }. 

5. Estimate performance measures using the scores S, labels Y t , m — YtUYl m 
and weights w m G W. 
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4 Experiments 



4.1 Datasets 

We surveyed the literature for published classifier scores with ground truth 
labels. One such dataset that we found was the Caltech Pedestrian Datase10 
(CPD), for which both detector scores and ground truth labels are available 
for a wide variety of detectors |5]- Moreover, the CPD website also has scores 
and labels available, using the same detectors, for other pedestrian detection 
datasets, such as the INRIA (abbr. INR) dataset [5]. 

We made use of the detections in the CPD and INR datasets as if they 
were classifier outputs. To some extent, these detectors are in fact classifiers, in 
that they use the sliding-window technique for object detection. Here, windows 
are extracted at different locations and scales in the image, and each window 
is classified using a pedestrian classifier (with the caveat that there is often 
some extra post-processing steps carried out, such as non-maximum suppression 
to reduce the number of false positive detections). For our experiments, we 
show the results on detectors and datasets to highlight both the advantages 
and drawbacks with using SPE. To make experiments go faster, we sampled the 
datasets randomly to have between 800-2,000 items. See [8] for references to all 
detectors. 

To complement the pedestrian datasets, we also used a basic linear SVM 
classifier and a logistic regression classifier on the "optdigits" (abbr. DGT) and 
"sat" (SAT) datasets from the UCI Machine Learning Repository [ITJ. Since 
both datasets are multiclass, but our method only handles binary classification, 
we chose one category for y — 1 and grouped the others into y — 0. Thus, each 
multi-class dataset was turned into multiple binary datasets. Planned future 
work includes extending our approach to multiclass classifiers. In the figures, 
the naming convention is as follows: "svm3" is used to mean that the SVM 
classifier was used with category 3 in the dataset being assigned to the y = 1 
class, and "logres9" denotes that the logistic regression classifier was used with 
category 9 being the y = 1 class, and so on. The datasets had 1,800-2,000 items 
each. 



4.2 Choosing class conditionals 

So far we have not discussed in detail which distribution families to use for 
the class conditional p y (s \ 9 y ) distributions. To find out which parametric 
distributions are appropriate for modeling the score class-conditionals, we took 
the classifier scores and split them into two groups, one for yi = and one 
for yi = 1. We used MLE to fit different families of probability distributions 
(see Figure [3] for a list of distributions) on 80% of the data (sampled randomly) 
in each group. We then ranked the distributions by the log likelihood of the 

We used BFGS-B [4] to carry out the optimization. To avoid local maxima, we used 
multiple starting points. 



Downloaded from http : //www .vision . caltech . edu/Image_Datasets/CaltechPedestrians/ 
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remaining 20% of the data (given the MLE- fitted parameters). In total, we 
carried out this procedure on 78 class conditionals from the different datasets 
and classifiers. 

Figure |3p shows the top-3 distributions that explained the class-conditional 
scores with highest likelihood for a selection of the datasets and classifiers. We 
found that the truncated Normal distribution was in the top-3 list for 48/78 
dataset class-conditionals, and that the Gamma distribution was in the top-3 
list 53/78 times; at least one of the two distributions were always in the top- 
3 list. Figure [3|V-F show some examples of the fitted distributions. In some 
cases, like Figure [3p, a mixture model would have provided a better fit than 
the simple distributions we tried. That said, we found that truncated Normal 
and Gamma distributions were good choices for most of the datasets. 

Since we use a Bayesian approach in equation Q, we must also define a 
prior on 9. The prior will vary depending on which distribution is chosen, and 
it should be chosen based on what we know about the data and the classifier. 
As an example, for the truncated Normal distribution, we use a Normal and a 
Gamma distribution as priors on the mean and standard deviation respectively 
(since we use sampling for inference, we are not limited to conjugate priors). As 
a prior on the mixture weight tt, we use a Beta distribution. 

In some situations when little is known about the classifier, it makes sense 
to try different kinds of class-conditional distributions. One heuristic, which 
we found worked well in our experiments, is to try different combinations of 
distributions for pa and pi, and then choose the combination achieving the 
highest maximum likelihood on the labeled and unlabeled data. 

4.3 Applying SPE 

Figure [2] shows SPE applied to different datasets. The left-most plots show 
the estimation error, as measured by the area between the true and predicted 
precision-recall curves, versus the number of labels sampled. The datasets in 
Figure [2]A-B and C-D were chosen to highlight the strengths and weaknesses of 
using SPE. Figure [2]\ shows SPE applied to the ChnFtrs detector in the CPD 
dataset. Already at 20 sampled labels, the estimate is very close (see Figure[2|3). 
In a few cases, e.g. in Figure [2p-D (logres8 on the DGT dataset), SPE does 
not fare as well. While SPE performs as well as the naive method in terms of 
estimation error, the score distribution is not well explained by the assumptions 
of the model, so there is a bias in the prediction. That said, despite the fact 
that SPE is biased in Figure [2p>, it is still far better than the naive method for 
100 labels. Ultimately, the accuracy of SPE depends on how well the score data 
fit the assumptions in Section [2] 

Figure [2jrC compares the estimation error of SPE to the naive method for 
different datasets, when only 20 labels are known. In almost all cases, SPE 
performs significantly better. Moreover, the variances in the SPE estimates are 
smaller than those of the naive method. 
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4.4 Classifier recalibration 



Applying SPE to a test dataset allows us to "recalibrate" the classifier to that 
dataset. Unlike previous work on classifier calibration [TJ [T7], SPE does not 
require all items to be labeled. For each unlabeled data item, we can compute 
the probability that it belongs to the y = 1 class by calculating the empirical 
expectation from the samples, i.e. p(yi = 1) = E [yt = 1 | 5, Y t \. 

Similarly, we can choose a threshold r to use with the classifier h(xi;r) 
based on some pre-dctcrmined criteria. For example, the requirement might 
be that the classifier performs with recall R(t) > f and precision P(t) > p. 
In that case, we define a condition C(r) = [R(t) > f A P(t) > p\. Then, for 
each r, we find the probability that the condition is satisfied by calculating the 
expectation p(C(r) = 1) = E[C(t)] over the unlabeled items Y{. Figure [4] 
shows the probability that C(r) is satisfied at different values of r. Thus, this 
approach can be used to choose new thresholds for different datasets. 

5 Related work 

Previous approaches for estimating classifier performance with few labels falls 
into two categories: stratified sampling and active estimation using importance 
sampling. Bennett and Carvalho [5] suggested that the accuracy of classifiers 
can be estimated cost-effectively by dividing the data into disjoint strata based 
on the item scores, and proposed an online algorithm for sampling from the 
strata. This work has since been generalized to other classifier performance 
metrics, such as precision and recall [5J. Sawade et al. proposed instead to use 
importance sampling to focus labeling effort on data items with high classifier 
uncertainty, and applied it to standard loss functions |T3] and F-measures [TS] . 
While both of these approaches assume that the classifier threshold r is fixed 
(see Section [2| and that a single scalar performance measure is desired, SPE 
can be applied to the tradeoff between different performance measures in the 
form of performance curves. 

Fitting mixture models to the class-conditional score distributions has been 
studied in previous work with the goal of obtaining smooth performance curves. 
Gu et al. [12] and Hcllmich et al. [13] showed how a two-component Gaussian 
mixture model can be used to obtain accurate ROC curves in different settings. 
Erkanli et al. [TU] extended this work by fitting mixtures of Dirichlet process pri- 
ors to the class-conditional distributions. This allowed them to provide smooth 
performance estimates even when the class-conditional distributions could not 
be explained by standard parametric distributions. Similarly, previous work 
on classifier calibration has involved fitting mixture models to score distribu- 
tions [TJ [T7] . In contrast to previous work, which require all data items to be 
labeled, SPE also makes use of the unlabeled data. This semisupervised ap- 
proach allows SPE to estimate classifier performance with very few labels, or 
when the proportions of positive and negative examples are very unbalanced. 
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6 Discussion 



We explored the problem of estimating classifier performance from few labeled 
items. We propose using mixtures of two densities to model the scores of classi- 
fiers. This allows us to predict performance curves even when a very small num- 
ber (none in the limit) of the samples are labeled. Using four public datasets, 
and multiple classifiers, we showed that classifier score distributions can of- 
ten be well approximated by two-component mixture models with standard 
parametric component distributions, such as truncated Normal and Gamma 
distributions. We demonstrated how our model, Semisupervised Performance 
Evaluation (SPE), can be used to estimate classifier performance, with confi- 
dence intervals, using only a few labeled examples. We presented a sampling 
scheme based on importance sampling for efficient inference. 

This line of research opens up many interesting avenues for future explo- 
ration. For example, is it possible to do unbiased active querying, so that 
the oracle is asked to label the most informative examples? One possibility in 
this direction would be to employ importance weighted active sampling tech- 
niques [31 [7], so similar in spirit to [T3J [TS] but for performance curves. Another 
future direction would be to extend SPE to mult i- component mixture models 
and multiclass problems. That said, as shown by our experiments, SPE al- 
ready works well for a broad range of classifiers and datasets, and can estimate 
classifier performance with as few as 10 labels (see Figure [T]) . 
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number of labels recall 



Figure 2: Applying SPE to different datasets. A: Estimation error, as measured 
by the area between the true and predicted precision-recall curves, versus the 
number of labels sampled, for the ChnFtrs detector on the CPD dataset. The 
red curve is SPE and the green curve shows the median error of the naive method 
(RND). The green band show the 90% quantiles of the naive method. B: The 
performance curve estimated using SPE (red) with 90% confidence intervals 
(magenta) with 20 known labels. The ground truth performance with all label 
known is shown as a black curve (GT), and the performance curve computed 
on 20 labels using the naive method from 5 random samples is shown in green 
(RND). Notice that the curves (in green) obtained from different samples vary 
a lot (although most predict perfect performance). C-D: same as A-B, but 
for the logres8 classifier on the DGT dataset (hand-picked as an example where 
SPE does not work well). E: Comparison of estimation error (area between 
curves) of SPE and naive method for 20 known labels and different datasets. 
The appearance of the markers denote the dataset (each dataset has multiple 
classifiers), and the lines indicate the standard error averaged over 10 trials. 
SPE almost always perform significantly better than the naive method. 
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Figure 3: Modeling class-conditional score densities by standard parametric 
distributions. A-F: Standard parametric distributions p y {s | 9 y ) (black solid 
curve) fitted to the class conditional scores for a few example datasets and 
classifiers. The score distributions are shown as histograms. In all cases, we 
normalized the scores to be in the interval Si € (0, 1], and made the truncation 
at s = for the truncated distributions. See Section 14.21 for more information. 
G: Comparison of standard parametric distributions best representing empirical 
class-conditional score distributions (for a subset of the 78 cases we tried). Each 
row shows the top-3 distributions, i.e. explaining the class-conditional scores 
with highest likelihood, for different combinations of datasets, classifiers and the 
class-labels (shown in brackets, y = or y = 1). The distribution families we 
tried included (with abbreviations used in last three columns in parentheses) 
the truncated Normal (n), truncated Student's t (t), Gamma (g), log- normal 
(In), left- and right-skewed Gumbel (g-1 and g-r), Gompertz (gz), and Frechet 
right (f-r) distribution. The last and second to last column show the relative 
log likelihood (r.1.1.) with respect to the best (1 st ) distribution. Two densities, 
truncated Normal and Gamma, are either top or indistinguishable from the top 
in all the datasets we tried. 
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Figure 4: Recalibrating the classifier by estimating the probability that a con- 
dition is met. A: The conditions in panel B shown as colored "boxes," e.g. 
the yellow curve shows the condition that the precision P > 0.5 and recall 
R > 0.5. The blue curve and confidence band show SPE applied to the ChnFtrs 
detector on the CPD dataset with 100 observed labels (black curve is ground 
truth). B: Probability that the conditions shown in A are satisfied for different 
score thresholds. Based on a curve like this, a practitioner can "recalibrate" 
a pre-trained classifier by picking a threshold for new dataset such that some 
pre-defined criteria (e.g. in terms of precision and recall) are met. 
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